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THE DIVERSITY OF A DISTRIBUTED GENOME IN 
BACTERIAL POPULATIONS 

By F. Baumdicker* , W. R. Hess* P. Pfaffelhuber*-' 1 ' 
Albert- Ludwigs University, Freiburg 

The distributed genome hypothesis states that the set of genes 
in a population of bacteria is distributed over all individuals that be- 
long to the specific taxon. It implies that certain genes can be gained 
and lost from generation to generation. We use the random genealogy 
given by a Kingman coalescent in order to superimpose events of gene 
gain and loss along ancestral lines. Gene gains occur at constant rate 
along ancestral lines. We assume that gained genes have never been 
present in the population before. Gene losses occur at a rate propor- 
tional to the number of genes present along the ancestral line. In this 
infinitely many genes model we derive moments for several statistics 
within a sample: the average number of genes per individual, the av- 
erage number of genes differing between individuals, the number of 
incongruent pairs of genes, the total number of different genes in the 
sample and the gene frequency spectrum. We demonstrate that the 
model gives a reasonable fit with gene frequency data from marine 
cyanobacteria. 

1. Introduction. Population genetics is dealing with biological diver- 
sity of species. Concepts developed in this area include models for genetic 
drift, mutation, selection, recombination and population structure. These 
models are applied frequently to eukaryotic species to analyze their evolu- 
tionary history. For prokaryotes these concepts are applied less frequently 
and Maynard-Smith (1995) even asked: "Do bacteria have population genet- 
ics?" 

Usually (by the biological species concept), a species is a reproductively 
isolated set of individuals. This definition can hardly be applied to prokary- 
otes, i.e. bacteria and archea. In microbiology, researchers have developed 
other approaches, mostly defining a species via genomic similarity. This sim- 
ilarity is either based on hybridization of DNA, or on DNA sequences of 
specific molecules such as ribosomal RNA or suitable housekeeping genes, 
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known to be highly conserved (and which are identified by genome sequenc- 
ing or by a technique called multilocus sequence typing; Maiden et al. 1998). 
The definition of bacterial species is complicated by the fact that the sim- 
ilarity of bacteria depends on the considered genomic region, which can be 
explained by transfer of genetic material between these bacteria (Dykhuizen 
and Green, 1991). Even more extreme, individuals from the same species 
carry different genes. For example, one quarter of the genome of a pathogenic 
variant of E. coli was found to be absent in a laboratory strain (Perna et al., 
2001). Such findings lead to several new hypotheses: The core genome hy- 
pothesis argues that the set of genes common to all bacteria of a species is 
responsible for maintaining species-specific phenotypic properties (e.g. Riley 
and Lizotte-Waniewski, 2009). The distributed genome hypothesis predicts 
that no single individual comprises the full set of genes of the bacterial 
population (e.g. Ehrlich et al., 2005). 

The distributed genome hypothesis is similar to the idea of bacterial 
pangenomes. Taking the different gene content of individuals of a popu- 
lation into account, the pangenome consists of all different genes carried by 
all individuals. The pangenome can be split into the core genome, i.e. the set 
of genes carried by every individual of the population, and the dispensable 
(also: auxiliary or flexible or contingency) genome (Medini et al., 2005). 
The pangenome was first analyzed for pathogenic strains of Streptococcus 
agalactiae (Tettelin et al., 2005). It was shown that around 80% of a single 
genome (i.e. the genome of a single individual) forms the core genome. How- 
ever, each fully sequenced genome carries genes which do not occur in other 
strains, suggesting that the core genome represents only a small fraction 
of the pangenome. The situation is even more extreme in Prochlorococcus 
and Synechococcus, which are marine cyanobacteria, where the core genome 
consists of around 60% of the genes found in a single genome (Kettler et al., 
2007; Dufresne et al., 2008). In contrast, Medini et al. (2005) show that a set 
of four genomes of Bacillus anthracis contain all genes found in the complete 
sample of 8 individuals, showing that the core genome is the biggest part 
of the pangenome of this species. Recently, the pangenome of all bacteria 
was considered, using a dataset of 573 completely sequenced genomes show- 
ing that only 250 genes (which are 8% of a bacterial genome on average) 
were common to almost all bacterial species (Lapierre and Gogarten, 2009; 
Bentley, 2009). 

The bacterial supragenome makes the split into the core and dispensable 
genome more precise: each gene present in a population (or in a sample) 
has a frequency (for core genes this is 100%) such that the pangenome 
gives rise to a gene frequency spectrum. The first analysis of Hogg et al. 
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(2007) on a sample of 13 genomes of Haemophilus influenzae shows that 
the largest class (19% of the pangenome) in the dispensable genome are 
genes only present in a single genome. In addition, every pair of genomes 
differs by around 300 genes on average. Similar findings were obtained for 
Streptococcus pneumoniae (Hiller et al., 2007). 

The pan- and supragenome suggest that genes can be gained and lost 
along lineages of bacteria, leading to diversity of genomes. It is well-known 
that genes can be gained in bacteria by three different mechanisms: (i) The 
uptake of genetic material from the environment is referred to as trans- 
formation, (ii) Bacteria can be infected by lysogenic phages which provide 
additional genetic material that can be built in the bacterial genome. This 
process is known as transduction, (iii) A direct link between two bacterial 
cells of the same species leads to exchange of genetic material, known as con- 
jugation. These three mechanisms are usually referred to as horizontal gene 
flow. Events of gene loss occur by mutations resulting in pseudogenization 
or deletion of genes. 

The aim of the present paper is to model the bacterial pangenome. We fo- 
cus on two different aspects: the genealogical relationships between individ- 
uals and the mutational mechanism. Using the diffusion limit of a standard 
neutral model (with finite offspring variance) leads to a random genealogy, 
usually referred to as the Kingman coalescent (Kingman, 1982; Wakeley, 
2008). Gene gain and loss is the basis of our mutational model, as intro- 
duced by Huson and Steel (2004) in the phylogenetics literature. Here, new 
genes are taken up from the environment at constant rate along ancestral 
lines. We assume that all genes taken up are different. In addition, present 
genes can be lost at constant rate. In analogy to standard population genetic 
models we refer to this as the infinitely many genes model. 

2. Model. The dynamics of our model consist of two parts. Reproduc- 
tion follows the (diffusion limit of a) neutral Wright-Fisher model (or some 
other exchangeable population genetic model with finite offspring variance). 
The mutation model we use is borrowed from the phylogenetics literature 
(Huson and Steel, 2004) and describes gene gains and losses along ances- 
tral lines. After introducing the model in Sections 2.1 and 2.2, we discuss 
connections to other mutation models in Section 2.3. 

2.1. Reproduction dynamics. We will use the neutral Wright-Fisher model: 
a panmictic population of size N reproduces neutrally and clonally, i.e. asex- 
ually. In this model, individuals in generation t + 1 choose a unique parent 
from generation t purely at random and independent of other individuals. 
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It is well-known that the genealogy of a sample of size n taken from the 
Wright-Fisher model converges for N —> oo after a time rescaling by N to 
the Kingman coalescent started with n lines (e.g. Durrett, 2008; Wakeley, 
2008). In this process, starting with n lines: 

• if there are k lines left, draw an exponential time with rate (2) (which 
equals the number of pairs in the k lines), which is the time to the 
next coalescence event, 

• at the next coalescence event pick two lines at random from the k lines 
and merge these into one line. 

If there is one line left, the sample has found its most recent common ancestor 
which we trace back into the past for an infinite amount of time. 

Definition 2.1 (Kingman coalescent). We denote the random tree re- 
sulting from the above mechanism - the Kingman coalescent - by T. We 
consider T as a partially ordered metric space with order relation ■< and 
metric dr where the distance of two points in T is given by the sum of the 
times to their most recent common ancestor. We make the convention that 
s -<t for s,t G T if s is an ancestor of t. 

We note that our starting point, the Wright- Fisher model can be replaced 
by other models. For continuous, overlapping generations, the Moran model 
is the most canonical choice. Generally, every exchangeable model with ge- 
nealogy - under a suitable time rescaling - given by the Kingman coalescent 
leads to the same results as those obtained in the present paper; the ge- 
nealogy of a sequence of exchangeable models converges to the Kingman 
coalescent if and only if the sequence of offspring distributions of a single 
individual has bounded finite variance and fulfills a condition regarding their 
third moments (Mohle and Sagitov, 2001). 

2.2. Mutation dynamics. We model individuals whose genomes consist 
of sets of genes. Every individual has a set of genes Q c , g c := \Q C \ which 
are absolutely necessary to survive and hence are conserved, i.e. must be 
passed from ancestor to offspring. The genes Q c constitute the core genome. 
In addition, we model an infinite gene pool by a set of genes I = [0, 1] with 
Q c n I = 0. The genome of individual % in the sample, 1 < % < N, contains 
genes Qi C I which are not necessary for the individuals to survive. This set 
of genes is called the dispensable genome of individual i. 

During the lifetime of every individuals or at every reproduction event, 
mutations may happen. In our mutation model, the infinitely many genes 
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model, we assume the following two mechanisms (in a Wright-Fisher pop- 
ulation of size N) which changes the dispensable genome from parent to 
offspring: 

• gene gain: before reproduction of individual i, there is a probability \i 
that a new gene u £ / is taken up from the environment. We assume 
that gene u has never been present in any genome of the population. 

• gene loss: every gene of the dispensable genome u S ft of individual i 
is lost with probability v before reproduction of individual i. 

We take an extreme point of view here in that the core genes are abso- 
lutely necessary for an individual to survive and the genes in the dispensable 
genome evolve completely neutral. Using this view, ancestry is not affected 
by mutations, i.e. all gene gains and losses seen in the population are as- 
sumed to be neutral. 

Using the same time-scaling as for the genealogies, we assume that n = hn 
and v = vpf are such that 9 = lirri7v-»oo 2Nfi^ and p = limjv^oo 2Nv^. After 
this rescaling of the parameters, new genes are gained at rate | and present 
genes are lost at rate ^. 

Definition 2.2 (Tree-indexed Markov chain for gene gain and loss). Let 
T be the Kingman coalescent. We either assume that T is rooted at the most 
recent common ancestor of the sample or that T has a single infinite line. 
Given T, we define a Markov chain Yj- = (Gt)teT, indexed by T, with state 
space ftff(L), the space of counting measures on I. (The Markov property for 
the tree-indexed Markov chain Yj- states that for all t £ T, (G s )t<s depends 
on {Q s ) s <t only through Q t .) Denoting by Xj the Lebesgue measure on I, Yj- 
makes transitions 



along T. Taking into account that the tree T has n leaves, one for each 
individual of the sample, we denote these leaves by l,...,n G T. In this 
setting, Q±, ...,Q n describe the genes present in individuals 1, ...,n. 

An illustration of the tree-indexed Markov chain is shown in Figure 1. 

Remark 2.3 (Notation). Note that all gained genes are almost surely 
different, so Qt does not have double points, i.e. Gt({u}) £ {0,1}, for all 
u £ / and t £ T, almost surely. We will use the following notation, equating 
counting measures without double points with their support: consider g £ 
Nf(I) without double points. There is m £ N and ui,...,u m with g = 



(2.1) 



from g to g + 6 U at rate |A/(dn) 
from g to g — S u at rate ^g(du) 




AUi'. Gain of gene u 
Jut: Loss of gene U{ 



Fig 1 . An illustration of the infinitely many genes model along a Kingman coalescent. If 
a gene is gained along a line (indicated by the L-sign) it can be lost again (indicated by a 
T-sign). An individual of the sample (i.e. a leaf of the coalescent tree) carries the set of 
genes which were gained along its ancestral lines and did not get lost again. Here are some 
examples: The gene U2 is present in all individuals, us is only present in 10 individuals of 
the left branch due to two gene losses. The genes 114 and U5 were lost in all ancestral lines 
and do not occur in any individual. The gene u-j is missing in the 2 individuals on the left 
side due to a gene loss and in the right branch as the gene gain was in the left branch. 



&ih- We will refer to u±, u m as the points in g and also write g = 
{ui, ...,u m }. Moreover, we define 




9s n gt •= g s a gt, g s \gt ■= (g s - gt) 



for g,g s ,gt G -A//C0- 

Our aim is to describe patterns of the dispensable genomes Q := (Qi, Q n ) 
or whole genomes (QiUQ c , ■■.,Q n ^Qc)- These results can then be compared to 
genomic data of a sample of bacteria which gives the genes (or gene families) 
carried by individuals in the sample. 

2.3. Comparison to other mutation models. In mathematical population 
genetics, there are several standard mutation models, e.g. the infinitely many 
alleles model and the infinitely many sites model] see e.g. Durrett (2008) or 
Ewens (2004). In the former, every mutation (along some random tree) leads 
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to a new type, also called a new allele. It is assumed that mutated alleles 
have never been present in the population before. The latter is a refinement 
of the former: the allele of an individual in the sample is modeled as an 
infinite stretch on DNA. Every mutation is assumed to change a single site 
on this genome (hence leading to a new allele) and it is assumed that every 
mutation hits a site that has never been hit before. The last assumption is 
relaxed in the finite sites model where sites can be hit several times changing 
their state between several possibilities. 

The infinitely many genes model as described above is conceptually dif- 
ferent from these standard models: the infinitely many sites model (along 
some random tree T) can be described, when a genome is given as the linear 
set I as above using that events, occurring at rate |A(du) along the tree, 
changes the state from the ancestral to a derived state at position u in the 
genome. However, loss events do not have a correspondence in the infinitely 
many sites model. In the finite sites model, a site can change from the an- 
cestral state to a derived state and back; however, in the infinitely many 
genes model, once a gene has changed from the ancestral state (not present) 
to the derived state (present) and back (not present) no further changes of 
the state are possible. 

Although all mutation models are conceptually different, the infinite sites 
model can be seen as the infinitely many genes model for p = 0. To under- 
stand this, consider a random tree with gene gain events (and no losses due 
to p = 0). Reinterpreting these gene gains as point mutations along a chro- 
mosome, each mutation hitting a new site, leads to the infinitely many sites 
model. However, there are still differences between the infinitely many genes 
model for p = and the infinitely many sites model. On the one hand, p = 
implies that genes cannot get lost which leads to an infinite genome for all 
individuals. On the other hand, most interesting quantities concentrate on 
mutations segregating (i.e. showing both the ancestral and the mutated - or 
derived - state) in the sample. In summary, the infinitely many genes model 
for p = and the infinitely many sites model are the same with respect 
to properties of segregating genes/sites. However, we will see in our results 
below (Theorems 2, 4 and 5) that the infinitely many genes model is not 
continuous in p = for certain aspects of segregating mutations. 

3. Results. In the rest of the paper, we fix a sample of size n £ N 
and 6, p > 0. We describe expectations and variances of several quantities of 
interest. If we want to stress the dependence on the model parameters we will 
use subscripts, e.g. E# p[.] in order to make this clear. Since the core genome is 
conserved for all individuals of the population, we focus on the dispensable 
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genome first. We provide results for the average number of genes in the 
sample (Section 3.1, Theorem 1), the average number of pairwise differences 
(Section 3.2, Theorem 2), incongruent pairs of genes (Section 3.3, Theorem 
3), the size of the dispensable genome of the sample (Section 3.4, Theorem 4) 
and the gene frequency spectrum (Section 3.5, Theorem 5). We then extend 
these results to the complete pangenome, i.e. the union of the dispensable 
and core genome (Section 3.6) and describe the application of our model 
to a dataset from Prochlorococcus, a marine cyanobacterium (Section 3.7). 
Finally, we discuss biologically realistic extensions of our model (Section 
3.8). 

3.1. Average number of genes. The simplest statistics in the infinitely 
many genes model is based on counting the number of genes for all individ- 
uals in the sample. The average number of genes (in the dispensable genome) 
is given by 



(3.i) A==£El&l- 

1=1 

Our first results provides the first and second moment for A. 

Theorem 1 (Average number of genes) . For A as above 

E[A] = °, 
P 

I a ft 



nl+p p(l + p) 

Remark 3.1. 1. Note that the result for K[A] is robust against changes 
of the reproduction mechanism or non-equilibrium situations. Consider any 
model of reproduction which has not gone extinct by time t. As long as the 
mutation mechanism is independent of reproduction, picking an individual 
at time t from the population gives a single ancestral line along which genes 
accumulate by the same distribution. In particular, the results for E[A] re- 
mains unaltered under population size changes or population subdivision. 
2. The fact that VL4] does not converge to as n — ► oo does not come as a 
big surprise: the sets Gi,—,Gn are dependent through the joint genealogy - 
given through the Kingman coalescent - relating the sample. 

3.2. Average number of pairwise differences. The average number of pair- 
wise differences is given by 

(3-2) D:=^ J2 \Si\Qi\- 

l<i^j<n 
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Theorem 2 (Average number of pairwise differences) . For D as above, 
9 

t (3 + Up + 23p 2 + 16p 3 + V + 49 + 2p9) 
V (l + p)2(2 + p )(3 + p)(l + 2p)(3 + 2p) 

6 + 19p + 19/) 2 + 12p 3 + 4p 4 + 86* + 4p6> 1 
+ (l + / 9)(2 + p)(3 + p)(l + 2p)(3 + 2p) n 

3 + llp+12/9 2 +4p 3 + 106* + 9p9 + 2p 2 9 2 \ 
+ (l + p)(2 + p)(3 + p)(l + 2p)(3 + 2p) n(n-l))' 

Remark 3.2. 1. The quantity D is only based on genes segregating in 
the sample. Hence, as explained in Section 2.3, the infinitely many genes 
model for p = is equivalent to the infinitely many sites model with respect 
to D. As the theorem shows, the expected number of differences between 
individuals i and j is E[\Gi \ Gj\ + \Gj \ Gi\] = ^j^,- Hence, IE p [D] is not 
continuous in p = since the comparable quantity in the infinite sites model, 
the average number of segregating sites in a sample of size two, is 9. The 
reason is that for small p, every individual carries a lot of genes, all of which 
can get lost at the small rate p. These loss events lead to differences between 
individuals as well as events of gene gain. A similar argument shows that the 
variance is not continuous; see e.g. Wakeley (2008, (4.15)) for the variance 
in the infinite sites model. 

2. Note that V[D] does not converge to as n — > oo. Again - the reason is 
that the differences (Gi \ Gj)i<i^j< n are dependent through the underlying 
common genealogy. 

3.3. Incongruent pairs of genes. Assume the following situation: for a 
pair of genes there are four individuals in which all four possible states 
of presence/absence of the two genes are observed. This means that the 
following situation is found: 





gene 1 gene 2 


Individual 1 
Individual 2 
Individual 3 
Individual 4 


present present 
present absent 
absent present 
absent absent 



If genes cannot be lost (p = 0) this situation cannot occur in our model. The 
reason is that gene 1 would indicate that individuals 1 and 2 have a common 



E[D] = 
Y[D} = 
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Fig 2. If genes can be lost (p > 0) it is possible that all four possible configurations of 
presence/absence of a pair of genes are seen in a sample of four individuals. We call such 
pairs incongruent. The figure shows one possible history of the two genes which leads to 
incongruence. 

ancestor before 1 and 3 have (otherwise individual 3 would also carry gene 
1), while gene 2 indicates that individuals 1 and 3 have a common ancestor 
before 1 and 2 have. This is the reason why we call pairs of genes for which 
the above situation appears incongruent. If p > 0, incongruent pairs can 
arise by gene loss; see Figure 2 for an example. We will now state how many 
incongruent pairs we can expect to see in our sample. 

The average number of incongruent pairs of genes (in four genomes) is 
given by 

n 

P := n(n-l)(n-2)(n-3) X! ' 

i,j,k,l=l 

where 

(3.3) D im -=\(9iUG j )\(G h ugi)\, l<i,j,k,l<n. 
Theorem 3 (Incongruent pairs of genes) . For P as above, 

vm = (?P 18 + 117^+203^ + 105^ 

1 J 4 (l + |)2(i + 2f)(l + 4f)(3 + 4f)(3 + 5f)(6 + 5f)(6 + 7f)" 

Remark 3.3. 1. In the proof of Theorem 3, we have to consider all 
possible genealogies of four individuals. In order to obtain the variance of 
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P, one would have to take into account all possible genealogies relating eight 
individuals. 

2. Note that 

19 4-3 9 
(3 - 4) E[A ^ ] = 6 2 (3 + p)(2 + p) = (3 + p)(2 + P ) 

by the gene frequency spectrum (Theorem 5). Using this results it can be 
shown that COVj-Djj^, Aifcjz] < in all cases. 

3. For p = we find that either m = or D^ji = implying that P = 
(hence Ep = o[P] = 0), almost surely. The theorem implies that K p [Dij^i ■ 

Dik,ji] — — 0, i.e. Ep[P] is continuous in p = 0. This observation is not 
obvious since small p implies that all individuals carry many genes. Using 
this fact, one could argue that the chance to observe a pair of genes giving 
rise to Dij^l ■ Dikji grows with decreasing p. However, although the number 
of genes grows for small p, pairs of genes giving rise to Dij^i • Dikji are most 
probably created by two gene gains and one gene loss, as shown in Figure 
2, for small p. 

4. As we will discuss in Section 3.8, the possibility of horizontal gene transfer 
(by bacterial conjugation) would be a biologically realistic extension of our 
model. Under such a mechanism, new genes are not only taken from the 
environment, but also from other individuals of the population. As a result, 
the genealogical tree would be different for different genes. Hence, the or- 
der of coalescence can be different and so there is an increased number of 
incongruent pairs of genes. Hence, the theorem is valuable for determining 
the possibility of horizontal gene transfer in real populations. 

3.4. Size of the dispensable genome. Now we come to properties involv- 
ing the whole sample (in contrast to pairs and quartets of individuals in the 
last theorems). The simplest statistics involving all individuals of the sample 
is the total number of genes, i.e. the size of the dispensable genome. 

The size of the dispensable genome is given by 

(3.5) G:=|(J&|. 

i=i 

We need the following definition in order to give the variance of the num- 
ber of genes in the dispensable genome in our next theorem. 

Definition 3.4 (The functions h k and g^) . For k > we define 



12 BAUMDICKER, : 

Moreover, for k = {k\,k 2 , k 3 ) with 

Ui = {k\ - l,k 2 ,k 3 ), 
fc' 3 = (h,k 2 ,k 3 ~ !)' 
A/ 5 = (fci-l,fc 2 + l,fc 3 ), 



SSS, PFAFFELHUBER 
i, k 2 , k 3 > 0, we set 

k[ 2 = (ki,k 2 - l,k 3 ), 

fc4 = (fcl + l,fc2-l,fc 3 -l), 
fc6 = (fcl-l,fc2,fc 3 + l), 



Ai = , A 2 = j + hk 2 + § k 2 , A 3 = (j j + hk 3 + f fc 3 , 

A 4 = fc 2 A:3, A 5 = A 6 = %ki 

and A = X)E=i ^ e define recursively 

(3.7) 



5fc 



jh kl+k2 , ifk 1 + k 3 = l, 
2 -h kl+k3 , ifk 1 + k 2 = l, 



and 



(3.8) 



5fc = (fcl + fc 2 )(/ci + fc 3 )|r 

A 

A. 



(3-9) + £ y (= ((fci + Wfci-H^ + (*i + fe 3 )/i fc 



m all other cases. 



Theorem 4 (Size of the dispensable genome) . For G as above, 

n-1 

(3.10) E[G] = 6J2 



n—l 



n P + * 

Ira addition, with g(k 1 ,k 2 ,k 3 ) given in Definition 3.4, 

(3.11) V[ G ] = /f ^-^(g^) 2 + ^ ( „,„, 0) . 

2=0 ' 2=0 ' 

Remark 3.5. 1. An estimate for the size of the pangenome (dispensable 
plus core genome) in real bacterial populations has attained much interest 
(e.g. Tettelin et al., 2005; Lapierre and Gogarten, 2009). Most interestingly, 
some species like Bacillus anthracis seem to have a closed genome, i.e. only 
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a limited number of genes in the pangenome, since no new gene was found 
after sequencing the fourth out of eight strains (Tettelin et al., 2005). Other 
species like Prochlorococcus and Synechococcus, have an open genome since 
estimates based on 22 different strains show that every newly sequenced 
genome exhibits 277 new genes on average (Baumdicker, unpublished ob- 
servation). For open genomes, a model based on some linguistic insights, 
Heap's law, has been considered (Tettelin et al., 2008). As a result, a power 
law for the total number of genes is found and it is estimated that a to- 
tal of n °- 43 ±°- 02 genes are found in a sample of n individuals from Bacillus 
cereus. This finding is in stark contrast to our theorem, which implies that 
the number of genes grows only logarithmically in n. However, in order to 
decide which is the correct asymptotics certainly requires a lot more data, 
since n = 14 strains are not enough to specify asymptotic behavior. 

2. We conjecture that Y[G] grows like E[G] for n — > oo. (The corresponding 
statement is true in the infinite sites model; see (Wakeley, 2008, (4.8)).) 
The reason is that for given T, G is Poisson distributed with a parameter 
increasing with the tree length. In addition, for large n the length of the 
Kingman coalescent is largest near the leaves and the coalescent almost 
becomes deterministic near the leaves. E.g. it has been shown that the sum 
of external branch lengths (i.e. branches connecting a leaf to the next node 
in the tree) converges to 2 in I? (Fu, 1995). 

3. We give an example for the computation of gk in the case k = (2,0,0). 
For the calculation, we observe that Ai = 1, A2 = A3 = A4 = 0, A5 = A6 = p, 
A = 1 + 2p and, from (3.6) and (3.7), 



2 

J 

P 
4 



2 2 _2(l + 2p) 



2 P 1 + P p(l + P) ' 



9(i,o,o) 



P- 



,2 ' 



5(i,i,o)= 9(i,o,i) = -{- + —^ 



4/1 1 



) 



4(1 + 2p) 
P 2 (l + P)' 
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The recursion (3.8) then gives 

_ 8 l/4fei \ 

5(2,0,0) - (1 + 2p)2 + rT2plrT2p + ff(1AO v + 



n%(rT2p (2/ll+2/l2)+ ^) 



(3.12) 



J 2 2 



(l + 2p) 2 p(l + 2p) 2 p 2 (l + 2p) (l + 2p) 2 

2 2 
+ (l + 2p)(l + p) + p{l + p) 



+ 



p 2 p(l + p) (l + 2p)(l + p)- 



= < + 0T7)) 



2 

+ 



,p (l + p)V (l + p) 2 (l + 2p)- 
Using (3.11) this then gives for n = 2, 



p(l + p) (i + p)2(i + 2 p)- 
For n = 3, the computation is more involved 1 and leads to 

, 2 90 + 249p + 275p 2 + 145p 3 + 30p 4 
V n=3 [Gj - - + — — + — — — + 



p 1 + p 2 + p (l + p) 2 (2 + p) 2 (l + 2p)(3 + 2p)(6 + 5p)' 

3.5. Gene frequency spectrum 2 . By definition, core genes are present 
in all individuals of the sample. In contrast, genes from the dispensable 
genome can be present at any frequency. These possibilities give rise to the 
gene frequency spectrum. 

The gene frequency spectrum (of the dispensable genome) is given by 
G±, G n , where 

(3.13) G k n) := G k := \{u G I : u e ft for exactly k different i}. 

Theorem 5 (Gene frequency spectrum). For G±, G n as above, 
n---(n-k+l) 



Several computations in the paper are most easily done using a program like Math- 
ematica. Therefore, a MATHEMATlCA-notebook with all relevant computations can be 
downloaded from the homepage of the corresponding author. 

2 The term gene frequency spectrum was used by Kimura (1964) to denote the fre- 
quency of alleles in the infinite sites model. Later, the term changed to site frequency 
spectrum since single sites on the chromosome could be sequenced (Durrett, 2008). Here, 
we reintroduce the term for gene frequencies in the infinitely many genes model. 
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Remark 3.6. 1. In the case p = 0, genes cannot get lost and conse- 
quently G n = oo. However, the classes k = l,...,n — 1 consist of genes 
segregating in the sample (since both states - presence and absence of the 
gene - are observed). Hence, as discussed in Section 2.3, these classes follow 
predictions for the infinite sites model. In this model, it is implicit in results 
already obtained by Wright (1938) (and later were refined by Kimura, 1964; 
Griffiths, 2003; Evans et al., 2007) that 



6_ 
k' 



E p=0 [G fe ] 

On the other hand, by the theorem, 

P lo On 



k 



k{n — k) 



such that the gene frequency spectrum is not continuous at p = 0. 
2. The model for the bacterial supragenome, introduced in Tettelin et al. 
(2005) takes population frequencies of genes into account, i.e. the gene fre- 
quency spectrum. While the supragenome model assumes several different 
frequency classes to begin with, we derive the gene frequency spectrum from 
first principles, i.e. from gene gain and loss events along the genealogy. 

3.6. Union of core and dispensable genome. Until now we only derived 
results for the dispensable genome. In data obtained from bacterial species, 
the union of the core and dispensable genome is of primary interest. It is 
straightforward to extend our results to this union: 

If we replace Qi by Qi U Q c , 1 < i < n, in (3.1), (3.2), (3.3), (3.5) and 
(3.13), recall g c := \Q C \, and denote the resulting quantities by A, D, P, 
D ijM , G, G k , we obtain 

A = A + g c , D = D, D ijM = D ijM , P = P, G = G + g c 

and 



Gk 



Gk, = 1, ...,n - 1 

G k + 9c, k = n. 



Hence, properties of A, D, P, G, G k follow immediately from Theorems 1 - 
5. 
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3.7. Application: a dataset from Prochlorococcus. Data from complete 
genomes of a population sample of bacteria have been available only for a 
few years. Because the infinitely many genes model we propose is new in the 
population genetic context, we show some data in order to see if the model 
as studied above could be realistic. 

Here we chose a set of n = 9 strains of Prochlorococcus which appear to 
be closely related. Prochlorococcus is a marine picocyanobacterium (length 
~ 0.6/xm, genome size ~ 2Mbp) living in the ocean at depth up to 200m. 
Their population size can be as large as 10 6 individuals (i.e. cells) per ml. 
In total, 22 complete genome sequences of these cyanobacteria are available 
in GenBank at the moment (Kettler et al., 2007; Dufresne et al., 2008). The 
n = 9 chosen Prochlorococcus genomes are similar to each other in terms of 
GC-content and share a similar physiology. 

We estimate the model parameters 9, p and g c based on the gene frequency 
spectrum G±, Gg which we compare with our results from Theorem 5. The 
number of genes present in all individuals is 1282, forming the largest class 
in the observed gene frequency spectrum (see Figure 3). Genes occurring in 
only a single individual were the second largest class with 1034 genes. By a 
least squares fit of Gk and E[Gfc] for k = 1, ...,n we obtain the estimates 

(3.14) # = 1142.17, p = 2.03, g c = 1270. 

Note that the estimate for g c means that we expect that 14 genes which are 
carried by all individuals belong to the dispensable genome. As shown in 
Figure 3, these estimates produce a reasonably good fit with the data. Of 
course, a statistical test which is able to reject our model for gene content in 
general, and the assumption that all genes in the dispensable genome evolve 
neutrally in particular, would be desirable. 

3.8. Outlook. We introduce the infinitely many genes model on a King- 
man coalescent as a simple null-model of genome evolution in bacterial 
species. However, both the reproduction and the mutation dynamics can be 
extended to become biologically more realistic. For the reproduction dynam- 
ics, several extensions have been considered in the literature, e.g. structured 
populations and populations of varying size; see e.g. Durrett (2008). 

The mutation dynamics can be extended as well. Our strongest assump- 
tion is that genes taken from the environment are completely new. In partic- 
ular, the model does not allow for genes being transferred between individ- 
uals directly. Such a physical exchange of genes between bacteria is known 
as horizontal gene transfer. The underlying mechanism is bacterial conju- 
gation. The donor cell produces a pilus that attaches to the recipient cell 

imsart-aap ver. 2009/08/13 file: 20091016BHP.tex date: October 16, 2009 



3 RESULTS 



17 




123456789 
k 



Fig 3. The fit of observed data from nine closely related strains of Prochlorococcus with 
the expectations for the gene frequency spectrum. Estimates were as given in (3.14). 

and a single strand of DNA is transported from the donor to the recipient. 
After replication of the DNA, both cells carry the transferred genetic mate- 
rial. The duration of conjugation is long enough in order to transfer several 
genes. Hence, by events of horizontal gene transfer, the transferred genes do 
not share the genealogy of the cell line. Thus, building such a mechanism 
into the above model requires the use of different genealogical trees for dif- 
ferent genes. Such a mechanism was already considered in the phylogenetics 
literature by Kunin and Ouzounis (2003). 

In order to add even more biological realism, at least three aspects can 
be considered: 

1. As Lefbure and Stanhope (2007) show there is frequent recombination 
even within the core genome. Such recombination can also be explained 
by conjugation and has attained much interest (e.g. Fraser et al., 2007) 
since the amount of recombination is known to be related to sequence 
similarity (e.g. Vulic et al., 1997), suggesting that bacterial species 
can be distinguished by the extent of recombination between strains 
(Dykhuizen and Green, 1991; Maynard-Smith, 1995). 

2. As seen in genomic data, several genes are clustered in gene families. 
This is best explained by events of gene duplication with a potential 
subfunctionalization of these genes along ancestral lines (e.g. Durrett, 
2008; Durrett and Popovic, 2007). 

3. There are certainly selective constraints on the number of genes in 
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the dispensable genome. If these genes are evolving neutrally and are 
not necessarily needed for a bacterium to function properly, selection 
should act in order to minimize the dispensable genome. 

Specifying the set of genes of an individual requires that the whole genome 
of the individual is sequenced. Finding the different genes in a dataset like 
the one used in the last section means that open reading frames (ORFs), 
i.e. regions in the genome between start codons and stop codons of all indi- 
viduals are found. In the dataset we say that two individuals carry the same 
gene if we find a pair of ORFs in both individuals that are highly similar. 
However, the DNA sequence of this pair of ORFs is usually not identical. Re- 
fined mutation models should extend our approach and describe the genomic 
diversity of the different genes as well as the variation of DNA sequences 
within the genes. 

4. The one- line-equilibrium and proof of Theorem 1. Consider 
a sample of size n and recall the sets of genes Qi,-..,Q n from Section 2.2. All 
results we provide with Theorems 1-5 are dealing with the joint distribution 
of Qi, Q n . We start with properties of one- and two dimensional marginals 
of the total masses of this joint distribution. First, we have to obtain a key 
result for the gene content along a single ancestral line in Section 4.1. The 
first two moments of the one- and two dimensional marginals are obtained 
in Section 4.2 which then lead to a proof of Theorem 1 in Section 4.3. 

4.1. The one-line equilibrium. We start with some arguments that will 
appear frequently in the next sections. For n = 1, the random tree T is only a 
single infinite line. We consider the gene content along a single ancestral line 
T = M_. In this setting, recall the process = (Gt)teR- from Definition 
2.2. Note that, almost surely, Qt does not have double points for all t £ R_. 
Recall our notation from Remark 2.3. 

Definition 4.1 (Poisson random measure and thinning). 1. We de- 
note by VOX{a) the distribution of a Poisson random measure with 
intensity measure a. We will also write VOX (a) for the Poisson dis- 
tribution with parameter a if a £ R+ . 

2. For g £ Mf{I), we denote by TTtZftf(g,p) the distribution of the ran- 
dom measure arising by keeping any point in g with probability p. 

Proposition 4.2 (Distribution of r K _). Let s <t. 

1 . Given Q s = g G Mf (I) , the two random measures Qt n Q s and Qt \ Q s 
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are independent. Their distribution is given by 

g t \g s r,voi( e -(i-e-^ t -^)-x I ). 

2. The distribution VOT(- p \i) is the unique equilibrium for and it is 
reversible. In equilibrium, Gt^G s ,Gt\ g s and g s \ g t are independent 
and their distributions are given by 

g t ng 8 ~POX^e-2^ -a 7 ), 

G,\Gt±Gt\G,~ V01{%1 - e -f ('"')) • Xj). 

Remark 4.3. Recall from Remark 2.3 that we identify GsiGt with the 
set of genes carried at times s and t. Note that Gs^Gt = Gs^Gt represents the 
genes present both at time s and at time t. Moreover, G s \Gt = (G s — Gt) + are 
the genes present at time s but absent at time t, i.e. genes lost during time 
(s, t] . The genes in Gt\G s = (Gt — G s ) + are genes gained during time (s, t\. As 
the proposition shows, all three quantities are independent in equilibrium. 

Proof. 1. First, recall that all new points in G s '-, s < s' < t are pair- 
wise different and different from points in G s , almost surely. During (s,t] 
several points of G s ' are lost. A point in G s ' is n °t lost with probability 
e 2 ( ' . Since all points are lost independently, we find that G s H Gt ~ 
THTAf(g, e~ 2~(*~ s )). Additionally, several new points in G arise during (s, t]. 
Hence, we find that Gt \ G s is independent of both, Gs and Gs^Gt- To obtain 
the distribution of Gt\G s , note that a point in G s ' \ G s '- is lost at rate ^ and 

hence is present in Gt with probability e~2~(*~ s ). Since new points arise at 
rate | during (s,t] and are lost independently, we find that the number of 

points in Gt \ Gs is Poisson distributed with parameter | Jq~ s e~^ t ~ s ^ds' = 

-(1 — e~2(*~ s )). Since these points must be uniformly distributed on /, we 

have that Gt \ G s ~ VOT{Ul - e~2(*~ s )) . A/). So we have shown the first 
assertion. 

2. To see that VOT(y\i) is the unique equilibrium of Tr, note that there can 
be at most one equilibrium since the Markov process is Harris recurrent. 

Moreover, if G s ~ POX^-Aj), then THlN{G s , e~ ^ s) ) =WI(Je^M. 
A/) and so 

Gt ~ ?OI(Je^^ s ) • A 7 ) * VO!(^(l - e"2(*- s )) . A/) = VOT^ ■ A/) 



20 BAUMDICKER, HESS, PFAFFELHUBER 

(where * denotes convolution). For reversibility, we write Gs = {G s C\Gt)^{G s \ 
Qt),St = (Gs n Gt) W (Gt \ G s ) where Gs n Gu G s \ Gt, Gt \ G s are independent, 
such that is in equilibrium at times s and t and 

GsnGt-VOl^e-^ -A/), 

G s \ Gt = Gt \ Gs ~ WJ(f (1 - e-i^)) • A/). 

By this representation, given some continuous functions /i,/2 : / — > M, 
writing x) := / /;cte, i = 1,2, 

E [ e -(/i,e s > . e -</2,0t>] = E [ e -(/i+/ 2 ,G s ng t >] . E [ e -(/i,C s \0*)>] . E[ e -</2,0t\0.>] 
= E[e-< /l+/2 ' esnet >] • E[e~^'°«\ 0, >] • E[e~ </2 ' es \ gi)) ] 
= E[e - ^ 2 '^ • e - ^ 1 '^]. 

Hence, since the joint Laplace transforms Efe - ^ 1 '^ -e - ^ 2 '^] determine the 

joint distribution of (G s ,Gt) uniquely, we find that (G s ,Gt) = (Gt,G s ) and 
reversibility is shown. □ 

4.2. Gene content in individuals and pairs. Next we obtain the first two 
moments of the two-dimensional distribution of (|C7i|, \Gn\)- 

Proposition 4.4 (Distribution of (Gi,Gj))- 1- For i = l,...,n, 

Gi-VOI^-Xj). 



In particular, 
2. For 1 < i ^ j < n, 



n\Si\]=nsi\] = - p 



COY[\Gi\,\Gj[ 



Remark 4.5. In the proof of 2. we use the well-known fact that for 
random variables X, Y, T 

COV[X,y] = COY[E[X\T],E[Y\T\] + E[COVLY, Y\T\] 

with 

COY[X,Y\T] := E[(X - E[X\T])(Y - E[Y\T])\T]. 
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PROOF. 1. Consider the ancestral line of individual i. The process (Gt)t<i 
follows the same dynamics as the process studied in Section 4.1. The 
first claim follows from Proposition 4.2.2, which shows that the unique re- 
versible equilibrium for this dynamics is VOT(^Xi). 

2. We denote the random coalescence time of individuals % and j by T. As 
above, Gi and Gj denote (the finite measures describing) the genes present 
in both individuals. Recall that we have shown in Proposition 4.2.2 that the 
equilibrium of the Markov chain = (Gt)teR- of Section 4.1 is reversible. 
Hence, given T, we have that (Gi,Gj) and (G-2T,Go) have the same distri- 
bution. So we find that Gi H Gj,Gi \ Gj and Gj \ Gi are independent and, by 
Proposition 4.2, 



- pT -Xi), G l \G 3 = 
Moreover, both sets of genes, Gi and Qj, are independent of T. We obtain 



Gi n Gj ~ VOT^ p e- pT • A/), Gi \ Gj = Gi \ Gj ~ POI(J(l - e~ pT ) • A/). 



[\Gi\AGj\] =COV[E[|ft||7l,E[|^||T]] +E[COV[|^|,|^||T]] 

= cov[J, |] + e[cov[|& n 0j| + |& \ ^||, \Gi n + \ Gi\\T}] 

as T ~ Exp(l). □ 

4.3. Proof of Theorem 1. Theorem 1 now follows from Proposition 4.4 
and 

1 n f) 

E[A] = -j2nm} = -, 

lb 1 (J 

1 = 1 r 

-, n n 

= ^(EMIftll+E cav[|&|,|$|]) 

1=1 »,J=1 
10 / In _ 1 

rep V n/p(l + p) nl + p p(l + p) 

□ 

5. Extension of Proposition 4.2 and proof of Theorem 2. The 

one-line equilibrium considered in Proposition 4.2 provides the right setting 
for computing the one- and two-dimensional marginals of Gi, Gn as shown 
in the proof of Proposition 4.4. In Section 5.1 we provide a method to com- 
pute higher order marginals. We will use this method for second (Section 
5.2), third (Section 5.3) and fourth (Section 5.4) order which finally leads 
to a proof of Theorem 2 in Section 5.5. 
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5.1. Extending the one-line equilibrium to a genealogical tree. Before we 
introduce the general method how to obtain all marginals of Q\, Q n we 
have to set the scene. Consider the genealogical tree T relating all n indi- 
viduals and the tree-indexed Markov chain IV = (Gt)teT- I n equilibrium, 
we have seen above that Qt ~ VOX{- p ■ A/) for all t € T. Hence, we now 

consider the case that T is a rooted tree with root r and Q r ~ VOT{^ ■ Xj). 
We need some notation to deal with the genealogical tree T. 

Definition 5.1 (Survival function). Let T be a binary tree with one 
distinguished point r £ T, referred to as the root of T, a finite set of leaves 
C C T and internal vertices V. For s,t £ T we denote by (s,t] the set 
of points which must be visited on any path between s and t. Moreover, 
dr(s,t) is the length of the path between s and t. Define a partial order 
■< on T by saying that s ^ t iff s £ (r,t] (such that r is the minimal 
element). For s,t £ T the point s At is given as the maximal element in 
{q : q ^ s and q ^ t}. For an internal node (i.e. a branch point) t £ T we 
denote by t\ and ti the two directions in T leading to bigger ( with respect to 
<) elements. 

We define the survival function pq- : T — > [0, 1] by 

Pr (t) = 1 forte C, 
(5-1) 9 ^ = P. pr{t ) fort£T\(CUV) 

pr(t) = 1 - (1 - pr(h))(l - PT(t 2 )) forteV 
where for / :T\(£UV)->1 

and t + e is any point in T with dj-(t, t + e) = e and t <t + e, if the limit 
exists. 

Proposition 5.2 (Probability of no loss along T). Let T be a binary 
tree, rooted at r, pq- as in Definition 5.1 and Fj- = {Qt)teT be the tree- 
indexed Markov chain from Definition 2.2 with Q r ~ VOT(^Xi). Then for 
u £ L and t £ T 



e (J G s \u£Qt =pr(t). 



sec 



Proof. Denote the probability on the left hand side by q(t). First note 
that q(t) = 1 if t G C since {s £ C : t < s} = {t}. Moreover, the probability 
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on the left hand side decreases exponentially at rate | along branches of T 
due to loss events of u. Last, consider the case t G V. Then, u must not be lost 
to either t\ or ti- This occurs with probability q(t) = l — (l — q(ti))(l—q(t2))- 
In other words, the function q fulfills all defining properties oip*r from (5.1) 
and we are done. □ 

We need some more notation for subsets of a finite binary rooted tree T. 

Definition 5.3 (Length and subtrees of T). We use the notation of 
Definition 5.1. 

1. For the binary tree T we denote by l(T) its total length, i.e. the sum 
of lengths of all its branches. 

2. Let £',M' C £ be sets of leaves with £' f~l M' = 0. We set 
r o := AteC * an d denote by T°(C) the minimal connected, binary tree 
spanning the leaves £, rooted at r . The set (T°(C U M') \ T°(C) 
consists of k < \M'\ different connected subtrees, connected with 
T°(£') at vertices r\,...,r^. We denote the resulting binary trees by 
T 1 (£ / , M'), ...,T k (C ,M'), rooted at ri,...,rk, respectively. 

Remark 5.4. 1. For an illustration of the objects introduced in Defini- 
tion 5.3 see Figure 4. 

2. If \£'\ = 1, it is important to note that T°(C) only consists of a single 
point. Consequently, £(T°(C)) = in this case. 



Proposition 5.5 (Distribution of f| Qs \ U Qt)- Let T be a finite binary 
tree, rooted at r G T, C its finite set of leaves and £',A4' C C with C'nM' = 
0. Moreover, let T°(C'),T l (£' ,M'), ...,T k (£',M') be as in Definition 5.3. 
Let Tq- = (Gt)teT be the tree-indexed Markov chain from Definition 2.2 with 
Qr^VOZ^. Then, 

f]0t\ U ^-Wj(^-^ ^))n(l-^ (w) (r,))-A,). 
tea teM' p i=i 

In addition, ifC",M" C £ with £" n M" = 0, then 

n Qt \ u & and n ^ \ u & 

taC teM' teC" teM" 

are independent if £' n M" / or £" n M' + 0. 
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r 1 



M' 

Fig 4. Illustration of concepts given in Definition 5.3.2. The subtree T°{£) is spanned by 
leaves in C! . Considering T \ T°{C'), the tree falls in three parts. Two of them which lead 
to a leaf in M! , are denoted T k (C' , M'), k = 1, 2. Roots of the trees are r° , r 1 ,r 2 . 

Remark 5.6. The pairwise independence in the proposition can be 
extended to independence of any number of random measures f)ted Gt \ 
UteMi Gt, i = 1, n, provided £ n Mj / or Cj n Mi / holds for any 
pair i^j. 

Proof. Given T, rooted at r G T, we have assumed that Q T = VOT(^ ■ 
A/), i.e. the tree-indexed Markov chain is in equilibrium. Consequently, Q ro ~ 
Q r . Every gene in HteC'Gt \ UteM'Gt must have been present in Q ro . In 
addition, every gene in Q ro has the same chance p to be present in n<e£' Gt \ 
UteM' Gt- This already shows that f)teC' Gt \ UteM' Gt ls a thinning of a 
Poisson measure and hence is Poisson with intensity - p p. It is important to 
note that a gene present in f] teC / Gt \ UteM' Gt must not be lost on the whole 
subtree T°(£), which occurs with probability e - 2^ ^ )) and must be lost 
on any subtree leading to a leaf in M' . However, the chance that a gene 
is lost along one such subtree is given through the survival function. In the 
subtree i, we have a root r« connecting the subtree to the tree spanned by £ 
and so 1 — PT i (a,M')^ ri ) 15 * ne probability that the gene is lost in all leaves 
inM'. 

For the independence property assume that £ n M" / or £' n M' / 0. 

Observe that ( f] teC , Q t \ UteM' Gt) n ( f)teC" Gt \ UteM" Gt) = in this case, 

i- e - C\teC Gt \ UteM' Gt and f)teC" Gt \ UteM" Gt arise by different Poisson 
events along T. The independence follows. □ 
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COROLLARY 5.7. For the same situation as in Proposition 5.5, if £ fl 

M' = C" n M" = 0, 

n u st\, \ n st\ u ftiM 

te£' teAf' te£" teA!" 

=v[i n &\ u &i 

tecuc teM'uM" 

Proof. We write 



T 



tec teM' tec teM'uM"uc 

y 



n^\ u ft=(rift\ u st) 

tec teM'uM"uC 

( n ft\ u n a\ n a). 

te£'u£" teAl'uA!" te£'u£"uA!" teM' 

n ft\ u &=( n a\ u 

tec teM'uM"uc 

( n ft\ u ft) w ( n ft\ u s^ 



tec teM" tec teM'uM"uC 

y 



te£'u£" teAf'uA!" te£'u£"uA!' teAf" 

By the independence statement in Proposition 5.5, only the covariances of 
the two second terms in both equalities do not vanish. The result follows. □ 

5.2. Gene content for two individuals. The simplest case in Proposition 
5.5 arises if T has only two leaves. This case was already studied in the proof 
of Proposition 4.4. We extend our analysis by the next result. 

Proposition 5.8 (Gene content for two individuals). For 1 < i ^ j < 

n, 

E i\Si \ Gj\] = Y~F~ p i 

V " ft \^ = (l + X + 2p) + I^' 

COVI|ft\g J |,|g J \ftl]= (1 + p) ^ 1 + 2p) - 

Proof. We use Proposition 5.5. It suffices to assume that T is a tree 
connecting individuals i and j, i.e. C = First we assume that the 

coalescence time T of the two individuals is given. Under this assumption, 
Proposition 5.5 tells us that 

ft\0 j ~7>0J(5(l-e-'* r ).A J ) 
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and, using the fact that expectation and variance are equal for a Poisson 
distributed random variable, 

e[|£\^||t] =v[\g i \g j \\T] = e p (i-e-<> T ), 

such that we obtain 

E[|ft \ Gj\] = E[E[|ft \ G 3 \\T] = E[f (1 - e~" T )] = ~ p ^~ p = ^ 

V[\Gi\Gj\] = Y[n\Gi\Gj\\T]] +K\y[\Si\Gj\\T\] 
= Vg(l-e-^)]+Eg(l- e -^)] 

_ e 2 / 1 _ _j n o_t i_x 

~p 2 Vl + 2p (l + p)2j + /9 l i + p J 

_ a 2 e 

~ (1 + p )2(i + 2p ) + l + p - 

In addition, given T, C/j \ Gj and <5j \ Gi are independent by Corollary 5.7. 
Hence, 

COV[\Gi \ Gj\, \Gj \ Gi\] = COVpE[|ft \ ^||T],E[|^ \ Gi\\T}] 

■liii-^]-^ 

□ 

5.3. Gene content for three individuals. Similar to Proposition 5.8, we 
use the general setting of Proposition 5.5 in order to prove results about the 
joint distribution of gene content in three individuals. 

Proposition 5.9 (Gene content for three individuals). For i,j,k € 
{l,...,n} pairwise different 

(5.2) COY[\Gi\Gj\,\Gi\Gk\} = n ^ , 2n VQ-LO , + 



(5.3) COV[|&\^U&\&l] 

(5.4) COY[\Gi\Gj\,\Gj\Gk\] 



(l + p) 2 (l + 2p)(3 + 2p) 2 + p 

tf 2 

(1 + p)2(l + 2p)(3 + 2p) 

fl 2 

(1 + p)2(i + 2 p)(3 + 2p) 

fl2 



(5.5) covOft\ftl, ICA^Il = (1 + p) , (1 + 2rt(3 + 2rt + (1 + p)(2 + ri 
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(A) l 


3 k 


(B) t 


k 3 


(C) j 


k i 




Fig 5. The 3 cases for a genealogical tree connecting three individuals i,j, k. 



Proof. We use Proposition 5.5 again. Let T be the tree connecting three 
individuals i,j and k, i.e. C = {i,j,k}. Assume the random times T2,T% 
during which the coalescent has 2,3 lines, respectively, and one of the three 
possible tree topologies, illustrated in Figure 5 are given. We use 
(5.6) 

COV[|&\0jl,l&\&l] = COV[E[|&\^l|nE[|&\&l|7l] 

+ E[coY[\g i \g j \,\g i \g k \\T]] 

and similar equalities for the other cases. We compute both parts of the right 
hand side separately. For the first part we need to calculate E[|^\^||T] 
depending on T: 

1. T G {(A)} : 

2T 3 

E[|ft\$lin = / e 2 e-^dt= e (l-e-^) 
o 

2. T £ {(B), (C)} : 

2T 2 +2T 3 

n\gi\gj\\T] = J °-e-i*dt = -{l - e-^+^a)) 
^ 

Replacing the pair ij in the last to expressions by ik,jk,ki or kj leads to 
the same possibilities arising in the genealogies (A), (B), (C). We collect all 
possibilities in Table 1. 

In Proposition 5.8 we have seen that E[E[|(?j\C7j| |T] = E[|^i\^-|] = and 
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(A) 


(B) 


(C) 


ij 


1. 


2. 


2. 


ji 


1. 


2. 


2. 


ik 


2. 


1. 


2. 


jk 


2. 


2. 


1. 


ki 


2. 


1. 


2. 


kj 


2. 


2. 


1. 



Table 1 

The three different tree topologies from Figure 5 give rise to two different terms for the 
conditional expectation of a pair of leaves, depending on the labeling of the pair. 



therefore, 



COV[E[|&\0il|nE[|&\&l|T]] 

= §gE[(l - e-^3))(i _ ^3)] + i g E[(1 _ e -p(T 2+ T 3))2] _ 



? 2 / 2p 6 6 

+ 



3p 2 V 3 + p (l + p)( 3 + p) (l + p)(3 + 2p) 

6 1 

+ 1 - v-, ^ r + 



(l + p)(3 + p) (l + 2p)(3 + 2p)V (l + p )5 
^ 2 



" (l + p)2(i + 2 p)(3 + 2p)- 

Note that this equation also holds for the other three cases in Proposition 
5.9, i.e. we have computed the first term in (5.6) for all combinations of 
k arising in the proposition. 
Let us now consider the second part of (5.6). From Corollary 5.7 we see 
that 

E[COV[|&\0il,|&\&l|7-]] =E[cov[\g i \g j \,\G j \G k \\T\] =0 

which already gives the assertions (5.3) and (5.4). Moreover, Corollary 5.7 
gives 

E[COV[|&\^USASfcll^1] =®[V[\Gi\(GjUGk)\\T\], 

ElcovWGAGjUGkXGjWT]] =E[v[\g i ng k \g j \\T\\. 

From Proposition 5.5 we know that for given T, |^i\(^U^fe)| and \Qir\Qk\Gj\ 
are Poisson distributed. Note that \Gi\(Gj U Gk)\ is the number of genes 

(n) 

present in i, but not in j and k. Recalling that G k denotes the number of 
genes present in k out of n individuals, it is clear that E[\Gi\(Gj U Gk)\] = 
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|E[G? } ], and so, using Theorem 5 

EW\Gi\(Gj u g k )\\r\] = E[E[\g t \(Gj u &)||r]] = e[|&\(£ u &)|] = 

Equivalent^, with E[|ft n </k\Sj|] = |E[Gj, 3) ], 

E[V[|ft n a fe \^||T]] = E[|ft n g fc \^'l] = {2 + p ° {1 + p y 

□ 

5.4. Gene content for pairs of two individuals. 

Proposition 5.10 (Gene content for pairs of two individuals). For 
i,j,k,l £ {l,...,n} pairwise different 

9 29^ 

comw.w>>w = (3+p)(2+rt + (1+P)2(3+ rt( 1+ 2rt(3 + 2rt 

PROOF. The proof is similar to the proof of Proposition 5.9. Analogously 
to (5.6) we use 
(5.7) 

COV[|ft\0;l,l&\ft|] 

= coY[E[\g l \g :i \\T},E[\Ok\Gi\\r}] + E[cov[\g i \g j \,\g k \g l \\T\] 

As EpE[|&\^||T]] = E[|&\0j|] = ITp we S et that 
COV[E[|ft\^-||T],E[|afc\ft||^]] =E[E[|ft\^||T]-E[|afc\ft||^]' 



(1 + P) 2 ' 

Therefore, four different cases occur depending on the topology of the tree 
seen in Figure 6: 

1. Te{(C),(D),(E),(F),(G),(H),(I),(J)}: 

^E[\g t \g 3 \\T] • E[|&\ftl|7] = (1 - e-^+ T4 ))(l - e -^+^+T 4)) 

2. Te{(4(B),(4(I)}: 

^E[|ft\^||T] • E[|&\ft||?l = (1 - e-" T4 )(l - e P^ +T ^) 

3. T€{(M),(R)}: 

&n\g i \g j \\T\ ■ E[|&\ftl|?l = (i - e" pT4 )(i - e-"( T 3 +T4 )) 
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Fig 6. The 18 equally probable topologies for a genealogical tree connecting four individuals 
i, j, k, I. 



4. Te{(N),(0),(P),(Q)}-- 



o 2 



^m\Gi\Gj\\T\ ■ n\Gk\Gi\\T] = (1 - e-'( T »+ T 3+ T *>) 2 . 



Hence, with a little help from Mathematica, 
(5.8) ' 

COV[E[|ft\^l|^]»E[|g fc \ft||T]] 
_ e 2 , 8 / 18 18 



(3 + p)(6 + p) (l + p)(3 + p)(6 + p) 

18 



(l + p)(3 + 2p)(6 + 2p) 
6 18 18 



6 + p (l + p)(3 + p)(6 + p) (l + p)(3 + p)(6 + 2p) 
6 18 18 



6 + p (3 + p)(6 + p) (3 + p)(6 + 2p) 
36 18 



(l + p)(3 + p)(6 + p) (l + 2p)(3 + 2p)(6 + 2p)^ {l + pf 
29 2 



" (l + p) 2 (3 + p)(l + 2p)(3 + 2p)' 
For the second term, Corollary 5.7 gives 

E[COV[|ft\&l> \Gj\Gi\\T}] = E[Y[D ijM \T]] 
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with Dij t ki as in (3.3). Given T, Aj,fc( is Poisson distributed, hence we obtain 
from (3.4) 

(5.9) E[Y[D ij , kl \T]]=E[E[D ij:M \T]} = 



(3 + p)(2 + p) - 

Combining (5.7) with (5.8) and (5.9) gives the result. □ 

5.5. Proof of Theorem 2. Using Propositions 5.8, 5.9 and 5.10, it is now 
easy to prove Theorem 2. We obtain 

n\n - 1) 2 Y[D] = £ (V[|& \ Gj\] + COV[|ft \ ^|, |£ \ 

+ J] (COV[|ft\^-|,|ft\g fc |] + COV[|ft\^|,|^\a fc |] 

i,j,fc pwd 

+ COV[|ft \ GjUGk \ Gi\] + COV[|ft \ |& \ 0,-1]) 

+ £ cov[|&\^U£A^I] 

i,j,k,l pwd 

= n(n - l)(V[|ft \ G 2 \] + COV[|fi?i \ &|, |& \ Oil]) 

+ n ( n _ i)( n _ 2)(COV[|S 1 \ &|, |0i \ &|] + COV[|fi?i \ G 2 \, \G 2 \ G 3 \] 

+ COV[|& \ &|, |<? 3 \ Gi\] + COV[|fi?i \ G 2 \, \G 3 \ G2W) 
+ n(n - l)(n - 2)(n - 3)CQV[|0i \ £ 2 |, |0 3 \ &|] 

and the result follows by some application of Mathematica. 

□ 

6. Proof of Theorem 3. We denote by T the genealogy connecting 
the individuals i,j,k,l. As above, we note that T is uniquely given by the 
random times T 2 , T3, T4 during which the coalescent has 2, 3, 4 lines, respec- 
tively, and the tree topology, distinguished by 18 equally probably cases, 
illustrated in Figure 6. We use 

(6.1) 

E[D ijM ■ D ikJl ] =E[COY[D ijikl ,D ikJl \T}] + E[E[D ijM \T] ■ E[D ikJl \T]] 
and note that 

COY[D ijM ,D ikdl \T\ = ° 

by Corollary 5.7. So, we are left with computing the second term in (6.1). 
The terms E[Aj,fc/|^"] can take six different values, depending on T. We use 
Proposition 5.5: 
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1. T e {(A), (-B)}: 

= e -f 2T4 (1 _ g -fT 3 + e -2 T3(1 _ e -f (T 3+ T 4))(1 _ g -f (2T 2+ T3 + T 4))) 
= e "2 2T4 - e -2( 2T3+3T4 ) - e -i(2T 2 +2T 3 +3T 4 ) + g -f (2T 2 +3T 3 +4T 4 ) 

2. Te{(G),(D),(G),(/)}: 

fE[Ai,«|r] = e -i( 2T 3+2T 4 ) (1 _ e -fT 4)(1 _ e -£(2T 2+ T 3+ T 4)) 
= e "2( 2T3+2T4 ) - e "2(273+3T 4 ) _ e -^(2T 2 +3T 3 +3T 4 ) + e _^(2T 2 +3T 3 +4T 4 ) 

3. T € {(D), (F), (G), (if), (J), (iV), (O), (P), (Q)}: 

gE[Ai,«|T] = e -i( 2 ^+ 2 ^+2T 4)(1 _ e -fr 4)(1 _ e -i(T 3+ T 4)) 

= e -2( 2T2 + 2T 3+ 2T4 ) _ e -5(2T 2 +27|i+3T 4 ) _ & - % (2T 2 +3T 3 +3T 4 ) 

+ e -^(2T 2 +3T 3 +4T 4 ) 

4. T€{(K),(L)} 

fE[D^ )W |T] = e~2 (2T 2+ 2T 3+ 2T 4)(1 _ ^-f (T 3+ T 4 ) + g -f (T 3+ 2T 4)) 
= e -2( 2,r 2+2T 3 +2T 4 ) _ 2e -2(2T 2 +3T 3 +3T 4 ) + e -^(2T 2 +3T 3 +4T 4 ) 

5. Te{(M)} 

inD ijM \T] = A 2T \\ - 2e - P 2^ +2T ^ + e -i( 2 ^+3T 3+2 r 4)) 
= e -2 2T4 - 2 e ~2^ 2T2+2T3+3T ^ + e -2( 2T 2+ 3T 3+ 4T 4) 

6. Te{(R)}: 

%E[D ijM \T] = e -i( 2 ^3+2T 4 ) (1 _ 2e -£(2T 2+ T 3+ T 4 ) + e _e(2T 2+ T 3+ 2T 4)) 
= e -fi(2T 3 +2T 4 ) _ 2e -^(2T 2 +3T 3 +3T 4 ) + e -£(2T 2 +3T 3 +4T 4 ) 

Relabeling k, I by i, k,j, I changes these cases. Table 2 gives the respon- 
sible terms for EfD^-^^T] and IEfD^ ji\T] for all 18 possible tree topologies. 
We obtain nine cases for which to compute the second term in (6.1). We 

abbreviate e := e 2. 
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(A) 


(B) 


(C) 


(D) 


(E) 


(F) 


(G) 


(H) 


(I) 


ij, kl 


1. 


1. 


2. 


3. 


2. 


3. 


2. 


3. 


2. 


ik,jl 


2. 


3. 


1. 


1. 


3. 


2. 


2. 


3. 


4. 




(J) 


(K) 


(L) 


(M) 


(N) 


(0) 


(P) 


(Q) 


(R) 


ij, kl 


3. 


4. 


4. 


5. 


3. 


3. 


3. 


3. 


6. 


ik,jl 


4. 


2. 


3. 


3. 


5. 


3. 


3. 


6. 


3. 



Table 2 

For et>en/ tree topology of Figure 6, the two pairs ij, kl as well as ik,jl fall into one 
the six cases for the conditional expectation; see below (6.1). 



Te{(A),(C)}: 

^E[D ij>kl \T}-E[D ikJl \T] 

_ ~2T 3 +4T 4 _ ~2T 3 +5T 4 _ ~2T 2 +3T 3 +5T 4 + ~2T 2 +3T 3 +6T 4 

_ g4T 3 +5T 4 + g4T 3 +6T 4 + ~2T 2 +5T 3 +6T 4 _ ~2T 2 +5T 3 +7T 4 

_ ~2T 2 +4T 3 +5T 4 + ~2T 2 +4T 3 +6T 4 + g4T 2 +5T 3 +6T 4 _ g4T 2 +5T 3 +7T 4 

+ ~2T 2 +5T 3 +6T 4 _ ~2T 2 +5T 3 +7T 4 _ g4T 2 +6T 3 +7T 4 + g4T 2 +6T 3 +8T 4 

T€{(B),(D)}: 

£[D ijtkl \T]-E[D ik!jl \T] 

_ ~2T 2 +2T 3 +4T 4 _ ~2T 2 +2T 3 +5T 4 
_ ~2T 2 +4T 3 +5T 4 _j_ ~2T 2 +4T 3 +6T 4 
_ ~4T 2 +4T 3 +5T 4 _j_ ~4T 2 +4T 3 +6T 4 
+ ~4T 2 +5T 3 +6T 4 _ ~4T 2 +5T 3 +7T 4 

T€{(E),(F)}: 

£[D ijM \T\-E[D ikS \T\ 

_ ~2T 2 +4T 3 +4T 4 _ ~2T 2 +4T 3 +5T 4 _ ~2T 2 +5T 3 +5T 4 + ~2T 2 +5T 3 +6T 4 
_ ~2T 2 +4T 3 +5T 4 + ~2T 2 +4T 3 +6T 4 + ~2T 2 +5T 3 +6T 4 _ ~2T 2 +5T 3 +7T 4 
_ g4T 2 +5T 3 +5T 4 + ~4T 2 +5T 3 +6T 4 + ~4T 2 +6T 3 +6T 4 _ ~4T 2 +6T 3 +7T 4 
+ ~4T 2 +5T 3 +6T 4 _ ~4T 2 +5T 3 +7T 4 _ ~4T 2 +6T 3 +7T 4 + ~4T 2 +6T 3 +8T 4 



~2T 2 +3T 3 +5T 4 _|_ ~2T 2 +3T 3 +6T 4 
~2T 2 +5T 3 +6T 4 _ ~2T 2 +5T 3 +7T 4 
~4T 2 +5T 3 +6T 4 _ ~4T 2 +57|i+7T 4 
~4T 2 +6T 3 +7T 4 , ~4T 2 +6T 3 +8T 4 
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TG {(G)}: 

^E[D ljM \T]-E[D ikJl \T] 

= g4T 3 +4T 4 _ g4T 3 +5T 4 _ g2T 2 +5T 3 +5T 4 + ~2T 2 +5T 3 +6T 4 

_ g4T 3 +5T 4 + g4T 3 +6T 4 + ~2T 2 +5T 3 +6T 4 _ ~2T 2 +5T 3 +7T 4 

_ ~2T 2 +5T 3 +5T 4 + ~2T 2 +5T 3 +6T 4 + g4T 2 +6T 3 +6T 4 _ ~4T 2 +6T 3 +7T 4 

+ ~2T 2 +5T 3 +6T 4 _ ~2T 2 +5T 3 +7T 4 _ g4T 2 +6T 3 +7T 4 + ~4T 2 +6T 3 +8T 4 



T€{(fl),(0),(P)}: 

^E^-wIT] -E[Afcji|^] 



_ ~4T 2 +4T 3 +4T 4 _ ~4T 2 +4T 3 +5T 4 _ ~4T 2 +5T 3 +5T 4 + ~4T 2 +5T 3 +6T 4 
_ ~4T 2 +4T 3 +5T 4 + ~4T 2 +4T 3 +6T 4 + ~4T 2 +5T 3 +6T 4 _ ~4T 2 +5T 3 +7T 4 
_ g-4T 2 +5T 3 +5T 4 + ~4T 2 +5T 3 +6T 4 + ~4T 2 +6T 3 +6T 4 _ ~4T 2 +6T 3 +7T 4 
+ g4T 2 +5T 3 +6T 4 _ ~4T 2 +5T 3 +7T 4 _ ~4T 2 +6T 3 +7T 4 + ~4T 2 +6T 3 +8T 4 

Te{(I),(K)}: 

^E[D ijM \T]-E[D iktjl \T] 

= ~2T 2 +4T 3 +4T 4 _ 2e r2T 2 +5T 3 +5T 4 + ~2T 2 +5T 3 +6T 4 
_ ~2T 2 +4T 3 +5T 4 + 2 g2T 2 +5T 3 +6T 4 _ ~2T 2 +5T 3 +7T 4 
_ g4T 2 +5T 3 +5T 4 + 2 g4T 2 +6T 3 +6T 4 _ ~4T 2 +6T 3 +7T 4 
+ g4T 2 +5T 3 +6T 4 _ 2 g4T 2 +6T 3 +7T 4 + ~4T 2 +6T 3 +8T 4 

T€{(J),(L)}: 

^E[D ljM \T]-E[D ikyjl \T] 

_ g4T 2 +4T 3 +4T 4 _ 2g4T 2 +5T 3 +5T 4 + g4T 2 +5T 3 +6T 4 
_ ~4T 2 +4T 3 +5T 4 + 2 g4T 2 +5T 3 +6T 4 _ ~4T 2 +57|i+7T 4 
_ g4T 2 +5T 3 +5T 4 + 2 g4T 2 +6T 3 +6T 4 _ ~4T 2 +67|i+7T 4 
+ g4T 2 +5T 3 +6T 4 _ 2g4T 2 +6T 3 +7T 4 + ~4T 2 +6T 3 +8T 4 

Te{(M),(N)}: 

^E[D ihkl \T]-E[D iktjl \T] 

_ ~2T 2 +2T 3 +4T 4 _ 2 ~4T 2 +4T 3 +5T 4 + g4T 2 +5T 3 +6T 4 
_ ~2T 2 +2T 3 +5T 4 + 2 ~4T 2 +4T 3 +6T 4 _ ~4T 2 +5T;i+7T 4 
_ ~2T 2 +3T 3 +5T 4 + 2 g4T 2 +5T 3 +6T 4 _ ~4T 2 +67;i+7T 4 
+ ~2T 2 +3T 3 +6T 4 _ 2 g472+5T 3 +7T 4 + ~4T 2 +6T 3 +8T 4 
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Te{(Q),(R)}: 

^E[D ijM \T]-E[D ik JT] 

_ ~2T 2 +4T 3 +4T 4 _ 2g4T 2 +5T 3 +5T 4 + g4T 2 +5T 3 +6T 4 
_ g2T 2 +4T 3 +5T 4 + 2g4T2+5T 3 +6r 4 _ ~4T 2 +5T 3 +7T 4 
_ ~2T 2 +5T 3 +5T 4 + 2g4T 2 +6T 3 +6T 4 _ g4T 2 +6T 3 +7T 4 
+ ~2T 2 +5T 3 +6T 4 _ 2g4T 2 +6T 3 +7T 4 + ~4T 2 +6T 3 +8T 4 

Combining the last equations and using Mathematica, we obtain the de- 
sired result. 

□ 

7. Proof of Theorems 4 and 5. Theorems 4 and 5 provide expecta- 
tions for the size of the dispensable genome and the gene frequency spec- 
trum, respectively. Recalling that is the number of genes in frequency 
k = 1, n in a sample of size n, it is clear that 

n 

k=i 

where G is given by (3.5). In addition, if E[G^], i = 1, is known from 
Theorem 5, 

n n -I n- n—1 -i 

Hence, the result for the expected number of genes in Theorem 4 can eas- 
ily be proved once we have established Theorem 5. However, we take an 
alternative route and give an independent proof of Theorem 4. 

7.1. An independent proof of Theorem 4- Recall the survival function p? 
from Definition 5.1. Consider the coalescent, denoted by T, started with n 
lines, rooted at r, the most recent common ancestor of the sample. As shown 
in Proposition 5.2, pq- : T — > [0, 1] gives the probability that a mutation that 
arises at t G T is not lost in at least one leaf. Hence, given T, we find that 
G ~ V01(tt(T)) with 

tt(T) := f / p r (t)cft. 

Next, we consider a random coalescent T with additional loss events at rate 
^ along the tree. We say that t G T is unlost if there is a leaf in i G T such 
that the path [t,i] is not hit by a loss event. Given T, note that 

pr (t) = F[t unlost|T] 
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by Proposition 5.2. 

To prove (3.10), we write immediately, using the above arguments, 



(7.1) 



E[G] 



E| 



pr{t)dt = E % I lit not lost)dt 
r J L Jr 

E [length of unlost lines in T] . 



We split the total length of unlost lines by times in the coalescent 

To compute the expected length of unlost lines we note that all lines are 
unlost near the leaves. The number of unlost lines decreases either by a 
coalescence or by a gene loss event. When there are k unlost lines left, any 
line is lost at rate ^ and two lines coalesce at rate (Jj) ■ Hence, the time until 

there are k — 1 unlost lines is exp {^k + (^ distributed. Thus, 



n 7 n 

k=l \2/ 2 fc=l 



k-l + p 



fe=0 



and we have proven (3.10). 

Next we show how to obtain the recursion for V[G] given in (3.11). Using 
the fact that, given T, the number of genes G is Poisson distributed with 
rate n(T), 

Y[G] = E[E[G 2 |T]] - E[G] 2 = E[vr(T) + 7r(T) 2 ] - E[G] 2 
= E[G] - E[G} 2 + f E[J J Pr (s)pr(t)dsdt\ . 

Since E[G] is known, it remains to compute the last term in the last display. 
Consider two independent Poisson processes V\ and V2 along the tree T, 
each at rate describing gene loss. As above, we say that a point s £ T 
is k- unlost if there is a leaf i G T such that the path [s, i] is not hit by an 
event in Pfe. We denote by the length of fc-unlost points in T, = 1,2. 
Using the same reasoning as in (7.1), 



E 



// 



pr{s)p r {t)dsdt = E[LiL 2 



The latter expectation can be derived via the following construction: in the 
tree T with the two independent Poisson loss processes V\ and V2, denote 
by K\{t) the number of lines which are both 1- and 2- unlost some distance r 
from the treetop, by 1^2 (t) the number of lines which are 2-lost but 1- unlost 
and by K^{t) the number of lines which are 1-lost but 2-unlost by time r. 
Clearly, 



L 



-F 

Jo 



{K^t) + K 2 (r))dr, L 2 



(Ki(r) + K 3 (r))dr. 
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In addition, K = (K(t)) t >o = (Ki(t), K 2 (t), K 3 (t)) t >o is a Markov jump 
process with the following rates from (k±, k 2 , k 3 ) to 



new state 


at rate 


M = ( k l ~ 1,^2, &3) 


Ai = ft) 


ti 2 = {h, k 2 - 1, k 3 ) 


A 2 = ft) + fcifo + §*2 


fc' 3 = (ki, k 2 , h - 1) 


A 3 = ft) + hk 3 + %k 3 


k' 4 = (h + l,k 2 - l,k 3 - 1) 


A 4 = k 2 k 3 


k l 5 = (ki-l,k 2 + l,k 3 ) 


A 5 = f fci 


k' 6 = (ki-l,k 2 ,k 3 + l) 


A 6 = p 2 ki 



Note that k_i, ...,k_Q, Ai, A6 are as defined in Definition 3.4. In each transi- 
tion, 2ki + k 2 + k 3 is not increasing and therefore hits after a finite number 
of transitions. 

To obtain (3.11), define the process K with K(0) = k:= (ki, k 2 , k 3 ) and 
define 

L\= r (Ki(t) + K 2 (r))dT, L\= r '(Ki(t) + K 3 (r))dT 
Jo Jo 

such that Li = L- n '°'°\ i = 1,2. We claim that 

g k _ := E[L*Lf] 

satisfies (3.7) as well as the recursion (3.8). 

First, given k\ + k 2 = 1 {k\ + k 3 = 1), there is only one 1-unlost line 
(2-unlost line) in T. In this case, this line can only be lost by an event in 
~Pi 0^2), independent of all other coalescence events. Hence, in this case, 
h\ and h\ are independent, EjXf] = - (E[L§] = |) and E[L§] = h^+kz 
{¥j[Lj\ = hk 1 +k 2 )- Combining these results gives (3.7). 

Second we show that E[Lf-L§] satisfies (3.8). Since K is a jump process, 
the first event occurs after an exponential time T with rate A, which is 
independent of the new state after the first jump. Conditioning on the first 
event happening at time T, 

L 1 L 2 = J2 X {ncw state is fc^}^ 1 + k ^ T + ^X^ 1 + k ^ T + L 2*)- 

1=1 
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Taking expectations 
completes the proof. 



Taking expectations on both sides shows that E[LfL§] satisfies (3.8). This 

□ 



7.2. Proof of Theorem 5. There are several ways to prove Theorem 5. 
We present here two approaches, one based on diffusion theory, the other 
one using an urn model. 

Proof of Theorem 5 based on diffusion theory. Assume that a 
gene is present at frequency Xq at time 0. Then, (Xt)t>o follows the SDE 



dX = -^Xdt + y/x(l - X)dW. 

Frequency spectra for such diffusions have been obtained by Kimura (1964). 
We follow the arguments given in Durrett (2008), Theorem 7.20. Assume we 
introduce new genes at frequency < 5 < 1 into the population at rate 

9 1 



2<t>{5) 

where /j,(x) := — cr 2 (x) := x(l — x), 

m := exp ( - 2 J o JL?Ldz) = exp (p^ — dz) 

= exp(-plog(l - y)) = (1 - yy p , 
Hx) := J%(y)dy= T ^- p (l-{l-x) 1 -P). 

This rate is consistent in 5: the number of genes at level e > 5 is | 

since ^||y is the probability that the gene reaches frequency e before dying 
out. Moreover, the Green function for the diffusion - measuring the time 
until eventual loss of the gene - is given by 2(f)(S)m(y) for y > 5, where 

m(y) 



is the density of the speed measure of the diffusion. Hence, we find that the 
number of genes in frequency x is Poisson with mean 

q(x)dx := 9—. ^. — dx. 

Hy ' x(i - x y-p 
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Now, the theorem follows since 

E[G k ] = Q £ g(x)x k (l - x) n - k dx = 6 Q £ x k -\l - x) n - k ~ 1+p dx 

_ 9 n- ■ ■ (n - k + 1) 
k (n — 1 + p) ■ ■ ■ (n — k + p) 

□ 

Proof of Theorem 5 based on an urn model. Let T be the King- 
man coalescent and Tq- be the tree-indexed Markov chain from Definition 
2.2. First, we focus on loss events for du C I along the random tree T. 
(We use the infinitesimal symbol du for notational convenience.) Since du is 
small, we may safely assume that there is at most one gene in du present in 
Ur=i Si- Gene loss events in du occur at constant rate ^ along each branch. 
Consider the tree T from the leaves to the root. Lines coalesce with pair 
coalescence rate 1, and any line hits a loss event in du at rate 5. Upon a loss 
event we kill the line off the tree. The resulting forest is well-known from the 
family decomposition in the infinite alleles model (e.g. Durrett, 2008, p. 14). 
Using Hoppe's urn, we can also generate the forest forward in time: consider 
an urn with one colored and one black ball. Choose the colored ball with 
probability proportional to 1 and the black one with probability propor- 
tional to p. When choosing a colored ball, put the chosen ball plus one ball 
of the same color into the urn. When choosing the black ball, put the black 
ball back together with a ball of a new color. In the next step, again choose 
any colored ball with probability proportional to 1 and the black balls with 
probability proportional to p. Proceed until there are n colored balls in the 
urn. Note that, given there are i colored balls in the urn, the chance that 

the next chosen ball is colored is ^7— = ttxty 2 — v, i-e. the chance equals 

the probability that two among i + 1 lines coalesce and are not killed off the 
tree by a gene loss event. 

To obtain the correct branch lengths in the tree, when there are i colored 
balls in the urn, wait an exponential time with rate |(i — 1 + p) until adding 
the next colored ball. This waiting time equals the time the coalescent stays 
with i lines, when pairs coalesce at rate 1 and single lines are killed at rate 
|. Hence, by this procedure, balls with the same color belong to the same 
tree in the forest and the time the forest spends with i lines is the same as 
viewing the coalescent backwards in time. 

So far, Hoppe's urn only described gene loss of the single gene u. Let 
us add gene gain of a gene in du to the description. During the evolution 
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of Hoppe's urn, which comes with its exponential waiting times, mark all 
colored balls at rate \du. When a marked colored ball is chosen, the added 
ball is again marked. Here, a mark stands for the presence of the considered 
gene along the corresponding ancestral line. Since du is small, there is at 
most one mark along the forest. 

For the forest given by the marked Hoppe's urn, we distinguish the times 
Ti,...,T n when there are 1, ...,n lines present. We say that line Z during T 
is of size k iff the ball belonging to this line produces exactly k — 1 offspring 
until the urn finishes. Hence, 

E[G&] = J E[du G Gi for exactly k different i] 

n i 

= ^2 51 IP[£th line during Tj is of size k] ■ 
i=i i=i 

J IPfmark in du on Zth line during T] 

and 

-du 6 

Plmark in du on Zth line during TA = — = — -du. 

1 J fCi-l + zo) i{i-l + p) 

Let us turn to the probability that the Zth line during T is of size k. The 
reasoning below is well-known from Polya urn models. When starting with 
i — 1 unmarked and one marked lines, there are (^Z\) possibilities at what 
times k — 1 marked balls are added when n — i balls are added to the urn in 
total. For any of these possibilities, the probability is 

(fc-l)!(z-l + g) •••(»- fc-l + p) 
(i + p) • ■ ■ (n - 1 + p) 

Putting everything together, 

E[G k ]=j2i( n ~^ {k ~ m ~ 1 + P) "' {n ~ k ~ 1 + P) ° 



k-ll (i + p) ■ ■ • (n - 1 + p) i(i - 1 + p) 



e k\ 



k (n — k + p) ■ ■ ■ (n — 1 + p) f-f I k 




_6 n- ■ ■ (n- k + 1) 
fc (n — fc + p) • • • (n — 1 + p) 

and we are done. □ 
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